Objective: + Visualization and Intution of Principal Componenet Analysis + Comprehending Clustering as a unsupervised Learning Algorithm. + Clustering using kmeans, hierarchical, density methodologies. + Creating tree clusters, heatmaps and silhouettes. + Data is segmented by a similarity criterion using distance metrics.

Principal Component Analysis(PCA)

PCA Principal Component Analysis

PCA Mathematical Framework:

+The total number of dimensions in a dataset is represented by the number of variables in it. Contextually if our data has 35 features then our data composes of 35 dimensions. It is difficult to decipher the substantive patterns within our multivariate dataset.

+To offset the aforementioned issue PCA creates components which are new variables. Each component is a linear combinations of all the previous variables which in our scenarios are the 35 feature.

+New Variables:

Variable 1: Componenent1

\[ y_1=\beta_1x_1+\beta_2x_2{+\beta}_3x_3+.....................\beta_{35}x_{35}\]

Variable 2 Component 2

\[y_1=\beta_1x_1+\beta_2x_2{+\beta}_3x_3+.....................\beta_{35}x_{35}\]

………………………… +Mathematically in comparison to the original large set of variables the newer albeit smaller set of variables named Principal Components extract credible and substantive information from the data by numerically and visually quantifying the maximal variability of the data. Principal components are linear combinations of the original variables that are progressively orthogonal to subsequent components and they capture 100% variability within the data in decreasing order of precedence. +The first principal component captures the largest variability depicted by a directional perspective on the first PC1 axis whereas the second principal component captures the second largest variability on the second PC2 axis.

+Numerically within the PCA framework the variability is measured by process of eigenvalues decomposition of covariance matrix or singular value decomposition of the data matrix after initial normalization of the data.

+PCA can also be used in conjunction with other unsupervised learning techniques like knn clustering or hierarchical clustering to showcase inherent data structures.

PCA

PCA

PCA

PCA

PCA PCA

PCA

PCA

Clustering

Clustering With K Means

Euclidean Distance as Similarity Metric

\[WCSS=\sum_{k=1}^{k}\sum_{x_i\epsilon C_k}(x_i-\mu_k)^2\]

k-means Clustering

k-means Clustering

Additional Distance Metrics

1.** Manhattan Distance**

\[d_m(x,y)=\sum_{k=1}^{n}|(x_i-y_i)|\]

2.** Pearson Correlation Coefficient**

\[d_c=1-\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt(\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar y)^2)}\]

3.** Cosine Correlation**

\[d_e=1-\frac{|\sum_{i=1}^{n}(x_i*y_i)|}{\sqrt(\sum_{i=1}^n(x_i)^2\sum_{i=1}^{n}(y_i)^2)}\]

5.** Spearman’s Correlation Distance** The x values are representing ranks.

\[d_s=1-\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sqrt(\sum_{i=1}^n(x_i-\bar{x})^2\sum_{i=1}^{n}(y_i-\bar y)^2)}\]

library(readxl)
library(ggplot2)
library(dplyr)
library(corrplot)
library(caret)
library(cluster)
library(factoextra)
library(magrittr)
library(fpc)

setwd('C:\\projects\\MachineLearning')

heartdata_initial<-read_excel("heart.xlsx")

heartdata_initial$Sex<-factor(heartdata_initial$Sex)
heartdata_initial$ChestPainType<-as.factor(heartdata_initial$ChestPainType)
heartdata_initial$FastingBPmorethan120<-as.factor(heartdata_initial$FastingBPmorethan120)
heartdata_initial$ExerciseInducedAngina<-as.factor(heartdata_initial$ExerciseInducedAngina)
heartdata_initial$Thal<-as.factor(heartdata_initial$Thal)
#heartdata_initial$ColoredVessels<-as.factor(heartdata_initial$ColoredVessels)
heartdata_initial$HeartAttackRisk<-as.factor(heartdata_initial$HeartAttackRisk)
heartdata_initial$RestingElectrographicResults<-as.factor(heartdata_initial$RestingElectrographicResults)


summary(heartdata_initial)
##       Age        Sex     ChestPainType RestingBloodPressure  Cholesterol   
##  Min.   :29.00   0: 96   0:143         Min.   : 94.0        Min.   :126.0  
##  1st Qu.:47.50   1:207   1: 50         1st Qu.:120.0        1st Qu.:211.0  
##  Median :55.00           2: 87         Median :130.0        Median :240.0  
##  Mean   :54.37           3: 23         Mean   :131.6        Mean   :246.3  
##  3rd Qu.:61.00                         3rd Qu.:140.0        3rd Qu.:274.5  
##  Max.   :77.00                         Max.   :200.0        Max.   :564.0  
##  FastingBPmorethan120 RestingElectrographicResults MaximumHeartRate
##  0:258                0:147                        Min.   : 71.0   
##  1: 45                1:152                        1st Qu.:133.5   
##                       2:  4                        Median :153.0   
##                                                    Mean   :149.6   
##                                                    3rd Qu.:166.0   
##                                                    Max.   :202.0   
##  ExerciseInducedAngina    Oldpeak         Slope       ColoredVessels   Thal   
##  0:204                 Min.   :0.00   Min.   :0.000   Min.   :0.0000   0:  2  
##  1: 99                 1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000   1: 18  
##                        Median :0.80   Median :1.000   Median :0.0000   2:166  
##                        Mean   :1.04   Mean   :1.399   Mean   :0.7294   3:117  
##                        3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000          
##                        Max.   :6.20   Max.   :2.000   Max.   :4.0000          
##  HeartAttackRisk
##  0:138          
##  1:165          
##                 
##                 
##                 
## 
heartdata<-heartdata_initial[c(1,4,5,8,10,12)]
heartdata_scaled<-scale(heartdata)

# Looking at the optimal clusters by elbow method
optimalclusters<-fviz_nbclust(heartdata_scaled,kmeans,method="wss")
print(optimalclusters)

# Running the k-means clustering algorithm.
kmeansdf<-kmeans(heartdata_scaled,2,nstart=25)
print(kmeansdf)
## K-means clustering with 2 clusters of sizes 139, 164
## 
## Cluster means:
##          Age RestingBloodPressure Cholesterol MaximumHeartRate    Oldpeak
## 1  0.6923882            0.3775110   0.2632115       -0.6121313  0.5377989
## 2 -0.5868412           -0.3199636  -0.2230878        0.5188186 -0.4558173
##   ColoredVessels
## 1      0.4827357
## 2     -0.4091480
## 
## Clustering vector:
##   [1] 1 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 1 2 1 1 2 1 2 2 2 2 2 1 2 2
##  [38] 2 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 2 2 1 1 2 2 2 2 1 2 2 2 1 1 2 2 2 1
## [112] 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 1 1 1 2 2
## [149] 2 2 1 1 1 1 2 2 2 2 1 2 2 2 2 2 2 1 1 1 1 1 1 2 2 1 1 2 1 1 2 1 1 1 2 1 1
## [186] 2 1 1 2 2 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 2 1 2 1 1 1 1 1 1 2 1 1
## [223] 1 1 1 1 1 2 1 1 2 1 1 1 1 2 2 1 1 2 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1
## [260] 2 1 2 1 1 2 1 1 1 1 1 2 1 1 2 2 2 1 2 1 1 1 2 1 2 1 1 1 2 1 1 2 1 1 1 2 1
## [297] 2 1 1 2 1 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 830.6576 573.4849
##  (between_SS / total_SS =  22.5 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
# Finding the optimal number of clusters by Gap method
fviz_nbclust(heartdata_scaled,kmeans,nstart=25,method="gap_stat",nboot=50)+labs(subtitle = "Gap Statistic Method")

# Numerical summaries of the clusters
cluster_aggregate<-aggregate(heartdata,by=list(cluster=kmeansdf$cluster),mean)

print(cluster_aggregate)
##   cluster      Age RestingBloodPressure Cholesterol MaximumHeartRate   Oldpeak
## 1       1 60.65468             138.2446    259.9065         135.6259 1.6640288
## 2       2 49.03659             126.0122    234.7012         161.5305 0.5103659
##   ColoredVessels
## 1      1.2230216
## 2      0.3109756
# Compare the above summaries to the Dataset grouped by Heart Atack Risk
tapply(heartdata_initial$Age, heartdata_initial$HeartAttackRisk, summary)
## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    35.0    52.0    58.0    56.6    62.0    77.0 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    29.0    44.0    52.0    52.5    59.0    76.0
tapply(heartdata_initial$Cholesterol, heartdata_initial$HeartAttackRisk, summary)
## $`0`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   131.0   217.2   249.0   251.1   283.0   409.0 
## 
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   126.0   208.0   234.0   242.2   267.0   564.0
#PAM Partitioning Around Medoids
# Finding the optimal number of clusters by Silhouette method and describing the cluster characterisitics:

fviz_nbclust(heartdata_scaled,pam,method="silhouette")+theme_classic()

pam_clusters<-pam(heartdata_scaled,2)
print(pam_clusters)
## Medoids:
##       ID       Age RestingBloodPressure Cholesterol MaximumHeartRate    Oldpeak
## [1,] 187  0.620304          -0.09258463   0.1299609       -0.2465324  0.3103986
## [2,] 149 -1.141403          -0.66277043  -0.3909653        0.8449247 -0.8953805
##      ColoredVessels
## [1,]      0.2646444
## [2,]     -0.7132490
## Clustering vector:
##   [1] 1 2 2 2 1 1 1 2 1 1 1 2 2 1 1 2 1 1 2 1 1 2 2 1 2 1 1 2 1 1 2 1 2 1 1 2 2
##  [38] 1 1 1 1 2 2 1 2 2 2 2 1 2 2 1 1 2 1 1 2 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 2 2
##  [75] 2 1 1 1 2 1 2 2 1 1 2 1 1 2 1 1 2 2 1 1 2 1 1 1 1 1 2 1 1 2 2 1 1 2 2 2 1
## [112] 2 1 2 2 2 2 1 2 2 1 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 1 1 1 1 2 2 2 1 1 1 2 1
## [149] 2 2 1 1 1 1 2 1 2 2 1 2 2 1 2 2 2 1 1 1 1 1 1 2 1 1 1 2 1 1 1 1 1 1 1 1 1
## [186] 2 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1
## [223] 1 1 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [260] 2 1 2 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1 1 1 1 2 1
## [297] 1 1 1 2 1 1 1
## Objective function:
##    build     swap 
## 2.099213 2.064042 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"
pamcluster_aggregate<-aggregate(heartdata,by=list(cluster=pam_clusters$cluster),mean)
pam_clusters$medoids
##            Age RestingBloodPressure Cholesterol MaximumHeartRate    Oldpeak
## [1,]  0.620304          -0.09258463   0.1299609       -0.2465324  0.3103986
## [2,] -1.141403          -0.66277043  -0.3909653        0.8449247 -0.8953805
##      ColoredVessels
## [1,]      0.2646444
## [2,]     -0.7132490
head(pam_clusters$clustering)
## [1] 1 2 2 2 1 1
# To add the cluster to original data for any further explorations.

clusterbind_heartdata<-cbind(heartdata_initial,pam_clusters$cluster)
head(clusterbind_heartdata)
##   Age Sex ChestPainType RestingBloodPressure Cholesterol FastingBPmorethan120
## 1  63   1             3                  145         233                    1
## 2  37   1             2                  130         250                    0
## 3  41   0             1                  130         204                    0
## 4  56   1             1                  120         236                    0
## 5  57   0             0                  120         354                    0
## 6  57   1             0                  140         192                    0
##   RestingElectrographicResults MaximumHeartRate ExerciseInducedAngina Oldpeak
## 1                            0              150                     0     2.3
## 2                            1              187                     0     3.5
## 3                            0              172                     0     1.4
## 4                            1              178                     0     0.8
## 5                            1              163                     1     0.6
## 6                            1              148                     0     0.4
##   Slope ColoredVessels Thal HeartAttackRisk pam_clusters$cluster
## 1     0              0    1               1                    1
## 2     0              0    2               1                    2
## 3     2              0    2               1                    2
## 4     2              0    2               1                    2
## 5     2              0    2               1                    1
## 6     1              0    1               1                    1
fviz_cluster(pam_clusters,ellipse.type = "t",ggtheme=theme_classic())

Hierarchical Clustering Analysis(HCA)

** Agglomerative Clustering (AGNES Agglomerative Nesting) **

** Divisive Clustering (Divise Analysis) ** + This clustering starts from the root and recursively subdivides into two clusters as per the herterogeneity to finally group every instance as a seperate cluster. This algorithm has a “top-down” paradigm.

PAM Clustering

PAM Clustering

Linkage Functions:

The main linkage functions are:

  1. Maximum or Complete Linkages The link at every step is ascertained by the shortest distances between to clusters.The distance between two cluster is maximum value of the pairwise distance between elements in the two clusters.
  2. Minimum or Single Linkages The distance between two cluster is minimum value of the pairwise distance between elements in the two clusters.
  3. Mean or Average LinkagesThe distance between two cluster is the average value of the pairwise distance between elements in the two clusters.
  4. Centroid LinkageThe distance between two cluster is the centroid value of the pairwise distance between elements in the two clusters.
  5. Ward’s Minimum Variance MethodAt each step pair of clusters with minimum between cluster distance are merged.This method minimizes the total within cluster variance.

A good visual and numerical illustrative is provided at the following link:

https://en.wikipedia.org/wiki/Complete-linkage_clustering#Distance_Matrix1

library(readxl)
library(ggplot2)
library(dplyr)
library(corrplot)
library(caret)
library(cluster)
library(factoextra)
library(magrittr)
library(fpc)

# Setting the working directory
setwd('C:\\projects\\MachineLearning')

# Reading the data
heartdata_initial<-read_excel("heart.xlsx")

heartdata_initial$Sex<-factor(heartdata_initial$Sex)
heartdata_initial$ChestPainType<-as.factor(heartdata_initial$ChestPainType)
heartdata_initial$FastingBPmorethan120<-as.factor(heartdata_initial$FastingBPmorethan120)
heartdata_initial$ExerciseInducedAngina<-as.factor(heartdata_initial$ExerciseInducedAngina)
heartdata_initial$Thal<-as.factor(heartdata_initial$Thal)
#heartdata_initial$ColoredVessels<-as.factor(heartdata_initial$ColoredVessels)
heartdata_initial$HeartAttackRisk<-as.factor(heartdata_initial$HeartAttackRisk)
heartdata_initial$RestingElectrographicResults<-as.factor(heartdata_initial$RestingElectrographicResults)

# Summarizing the data
summary(heartdata_initial)
##       Age        Sex     ChestPainType RestingBloodPressure  Cholesterol   
##  Min.   :29.00   0: 96   0:143         Min.   : 94.0        Min.   :126.0  
##  1st Qu.:47.50   1:207   1: 50         1st Qu.:120.0        1st Qu.:211.0  
##  Median :55.00           2: 87         Median :130.0        Median :240.0  
##  Mean   :54.37           3: 23         Mean   :131.6        Mean   :246.3  
##  3rd Qu.:61.00                         3rd Qu.:140.0        3rd Qu.:274.5  
##  Max.   :77.00                         Max.   :200.0        Max.   :564.0  
##  FastingBPmorethan120 RestingElectrographicResults MaximumHeartRate
##  0:258                0:147                        Min.   : 71.0   
##  1: 45                1:152                        1st Qu.:133.5   
##                       2:  4                        Median :153.0   
##                                                    Mean   :149.6   
##                                                    3rd Qu.:166.0   
##                                                    Max.   :202.0   
##  ExerciseInducedAngina    Oldpeak         Slope       ColoredVessels   Thal   
##  0:204                 Min.   :0.00   Min.   :0.000   Min.   :0.0000   0:  2  
##  1: 99                 1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000   1: 18  
##                        Median :0.80   Median :1.000   Median :0.0000   2:166  
##                        Mean   :1.04   Mean   :1.399   Mean   :0.7294   3:117  
##                        3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000          
##                        Max.   :6.20   Max.   :2.000   Max.   :4.0000          
##  HeartAttackRisk
##  0:138          
##  1:165          
##                 
##                 
##                 
## 
# Scaling the data
heartdata<-heartdata_initial[c(1,4,5,8)]
summary(heartdata)
##       Age        RestingBloodPressure  Cholesterol    MaximumHeartRate
##  Min.   :29.00   Min.   : 94.0        Min.   :126.0   Min.   : 71.0   
##  1st Qu.:47.50   1st Qu.:120.0        1st Qu.:211.0   1st Qu.:133.5   
##  Median :55.00   Median :130.0        Median :240.0   Median :153.0   
##  Mean   :54.37   Mean   :131.6        Mean   :246.3   Mean   :149.6   
##  3rd Qu.:61.00   3rd Qu.:140.0        3rd Qu.:274.5   3rd Qu.:166.0   
##  Max.   :77.00   Max.   :200.0        Max.   :564.0   Max.   :202.0
heartdata_scaled<-scale(heartdata)
summary(heartdata_scaled)
##       Age           RestingBloodPressure  Cholesterol      MaximumHeartRate 
##  Min.   :-2.79300   Min.   :-2.14525     Min.   :-2.3203   Min.   :-3.4336  
##  1st Qu.:-0.75603   1st Qu.:-0.66277     1st Qu.:-0.6804   1st Qu.:-0.7049  
##  Median : 0.06977   Median :-0.09259     Median :-0.1209   Median : 0.1464  
##  Mean   : 0.00000   Mean   : 0.00000     Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.73041   3rd Qu.: 0.47760     3rd Qu.: 0.5448   3rd Qu.: 0.7139  
##  Max.   : 2.49212   Max.   : 3.89872     Max.   : 6.1303   Max.   : 2.2856
# Creating the dissimilarity metric using euclidean distance.
heartdata_dist<-dist(heartdata_scaled,method="euclidean")

# Displayng the distance
head(heartdata_dist)
## [1] 3.412320 2.799080 2.030687 2.870628 1.072887 1.441386
as.matrix(heartdata_dist)[1:7,1:7]
##          1        2        3        4        5        6        7
## 1 0.000000 3.412320 2.799080 2.030687 2.870628 1.072887 1.441386
## 2 3.412320 0.000000 1.187644 2.220144 3.209136 3.053821 2.761474
## 3 2.799080 1.187644 0.000000 1.871547 3.458114 2.140142 2.599257
## 4 2.030687 2.220144 1.871547 0.000000 2.371514 1.936149 1.934928
## 5 2.870628 3.209136 3.458114 2.371514 0.000000 3.390932 1.686193
## 6 1.072887 3.053821 2.140142 1.936149 3.390932 0.000000 1.983073
## 7 1.441386 2.761474 2.599257 1.934928 1.686193 1.983073 0.000000
# Linkage function utilizes the dustance as a proximity metric and pair wise merges the instances thereby creating larger clusters with every successive iteration. Using linkage function ward 2


agg_tree_ward<-hclust(d=heartdata_dist,method="ward.D2")

print(agg_tree_ward)
## 
## Call:
## hclust(d = heartdata_dist, method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 303
# Visualizing the Dendogram

fviz_dend(agg_tree_ward,cex=.5)

# Cutting the tree to create 2 clusters and visualizng it.

agg_tree_warddend<-fviz_dend(agg_tree_ward,cex=.5,k=2,palette = "jco")

agg_tree_warddend

# To access the partition accuracy of the cluster tree (created by hclust()) there should be a strong correlation between # # the original distance matrix and the object linkage distance defined as cophenetic distances. 

# Calculating Cophenetic Distances

agg_cophenetic<-cophenetic(agg_tree_ward)

head(agg_cophenetic)
## [1] 23.098528 23.098528  7.989358 14.023530  3.844826 14.023530
# Calculating the correlation between Cophenetic distances and original distances for :

cor(heartdata_dist,agg_cophenetic)
## [1] 0.4031523
# using average linkage function 

agg_tree_average<-hclust(d=heartdata_dist,method="average")

fviz_dend(agg_tree_average,cex=.5)

# Cophenetic Distances

agg_cophenetic<-cophenetic(agg_tree_average)

# correlation between Cophenetic distances and original distances:

cor(heartdata_dist,agg_cophenetic)
## [1] 0.613122
# cut the Tree into clusters

two_groups<-cutree(agg_tree_ward,k=2)
table(two_groups)
## two_groups
##   1   2 
## 185 118
head(two_groups,n=4)
## [1] 1 2 2 1
fviz_dend(agg_tree_average,k=3,cex=.5,color_labels_by_k =TRUE,rect=TRUE)

fviz_cluster(list(data=heartdata_scaled,cluster=two_groups))

two_groups<-cutree(agg_tree_average,h=2)
table(two_groups)
## two_groups
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
## 83 53 37  5 17 39  4  9 11  8  1  4 18  5  1  3  2  1  1  1
head(two_groups,n=4)
## [1] 1 2 2 1
fviz_dend(agg_tree_ward,k=2,cex=.5,color_labels_by_k =TRUE,rect=TRUE)

fviz_cluster(list(data=heartdata_scaled,cluster=two_groups))

# The Cluster package also provides Agglomerative and Divisive methodology

#Agglomerative 
agnes_cluster<-agnes(x=heartdata_scaled,stand=TRUE,metric = "euclidean",method="ward")

agnes_cluster$ac
## [1] 0.9707085
agnes_tree<-pltree(agnes_cluster, cex = 0.6, hang = -1, main = "Dendrogram of Agnes")

print(agnes_tree)
## NULL
# plot.hclust()
plot(as.hclust(agnes_cluster), cex = 0.6, hang = -1)

# Divisive
diana_cluster<-diana(x=heartdata_scaled,stand=TRUE,metric = "euclidean")

fviz_dend(agnes_cluster,cex=.6,k=2)

fviz_dend(diana_cluster,cex=.6,k=2)

# Heatmaps are used for Visualizing Hierarchical clustering.
# Heat Maps are used to visualize clusters of samples and features. The high values are in red and low in yellow.

heatmap(heartdata_scaled)

library(gplots)
## Warning: package 'gplots' was built under R version 3.6.3
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
heatmap.2(heartdata_scaled,scale="none",col=bluered(100),trace = "none",density.info = "none")

# Visually Appealing
library(pheatmap)
## Warning: package 'pheatmap' was built under R version 3.6.3
pheatmap(heartdata_scaled, cutree_rows = 2)

# Interactive Heatmap

library(d3heatmap)
## Warning: package 'd3heatmap' was built under R version 3.6.3
d3heatmap(scale(heartdata),k_row=4,k_col=2)

Clustering Tendency Assessment

library(clustertend)
library(factoextra)
heartdata_initial<-read_excel("heart.xlsx")

heartdata_initial$Sex<-factor(heartdata_initial$Sex)
heartdata_initial$ChestPainType<-as.factor(heartdata_initial$ChestPainType)
heartdata_initial$FastingBPmorethan120<-as.factor(heartdata_initial$FastingBPmorethan120)
heartdata_initial$ExerciseInducedAngina<-as.factor(heartdata_initial$ExerciseInducedAngina)
heartdata_initial$Thal<-as.factor(heartdata_initial$Thal)
#heartdata_initial$ColoredVessels<-as.factor(heartdata_initial$ColoredVessels)
heartdata_initial$HeartAttackRisk<-as.factor(heartdata_initial$HeartAttackRisk)
heartdata_initial$RestingElectrographicResults<-as.factor(heartdata_initial$RestingElectrographicResults)

# Summarizing the data
summary(heartdata_initial)
##       Age        Sex     ChestPainType RestingBloodPressure  Cholesterol   
##  Min.   :29.00   0: 96   0:143         Min.   : 94.0        Min.   :126.0  
##  1st Qu.:47.50   1:207   1: 50         1st Qu.:120.0        1st Qu.:211.0  
##  Median :55.00           2: 87         Median :130.0        Median :240.0  
##  Mean   :54.37           3: 23         Mean   :131.6        Mean   :246.3  
##  3rd Qu.:61.00                         3rd Qu.:140.0        3rd Qu.:274.5  
##  Max.   :77.00                         Max.   :200.0        Max.   :564.0  
##  FastingBPmorethan120 RestingElectrographicResults MaximumHeartRate
##  0:258                0:147                        Min.   : 71.0   
##  1: 45                1:152                        1st Qu.:133.5   
##                       2:  4                        Median :153.0   
##                                                    Mean   :149.6   
##                                                    3rd Qu.:166.0   
##                                                    Max.   :202.0   
##  ExerciseInducedAngina    Oldpeak         Slope       ColoredVessels   Thal   
##  0:204                 Min.   :0.00   Min.   :0.000   Min.   :0.0000   0:  2  
##  1: 99                 1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000   1: 18  
##                        Median :0.80   Median :1.000   Median :0.0000   2:166  
##                        Mean   :1.04   Mean   :1.399   Mean   :0.7294   3:117  
##                        3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000          
##                        Max.   :6.20   Max.   :2.000   Max.   :4.0000          
##  HeartAttackRisk
##  0:138          
##  1:165          
##                 
##                 
##                 
## 
# Scaling the data
heartdata<-heartdata_initial[c(1,4,5,8)]
summary(heartdata)
##       Age        RestingBloodPressure  Cholesterol    MaximumHeartRate
##  Min.   :29.00   Min.   : 94.0        Min.   :126.0   Min.   : 71.0   
##  1st Qu.:47.50   1st Qu.:120.0        1st Qu.:211.0   1st Qu.:133.5   
##  Median :55.00   Median :130.0        Median :240.0   Median :153.0   
##  Mean   :54.37   Mean   :131.6        Mean   :246.3   Mean   :149.6   
##  3rd Qu.:61.00   3rd Qu.:140.0        3rd Qu.:274.5   3rd Qu.:166.0   
##  Max.   :77.00   Max.   :200.0        Max.   :564.0   Max.   :202.0
heartdata_scaled<-scale(heartdata)
summary(heartdata_scaled)
##       Age           RestingBloodPressure  Cholesterol      MaximumHeartRate 
##  Min.   :-2.79300   Min.   :-2.14525     Min.   :-2.3203   Min.   :-3.4336  
##  1st Qu.:-0.75603   1st Qu.:-0.66277     1st Qu.:-0.6804   1st Qu.:-0.7049  
##  Median : 0.06977   Median :-0.09259     Median :-0.1209   Median : 0.1464  
##  Mean   : 0.00000   Mean   : 0.00000     Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.73041   3rd Qu.: 0.47760     3rd Qu.: 0.5448   3rd Qu.: 0.7139  
##  Max.   : 2.49212   Max.   : 3.89872     Max.   : 6.1303   Max.   : 2.2856
# The data do have 2 main by visualization:

fviz_pca_ind(prcomp(heartdata_scaled),title="Heart Attack Risk Data",habillage =heartdata_initial$HeartAttackRisk,palette = "jco",geom = "point",ggtheme=theme_classic(),legend="bottom" )

kmeans_clust<-kmeans(heartdata_scaled,2)

fviz_cluster(list(data=heartdata_scaled,cluster=kmeans_clust$cluster),ellipse.type = "norm",geom="point",stand=FALSE,palette="jco",ggtheme = theme_classic())

# Calculating hopkins statistics which shows that data does exhibit inherent patterns.
hopkins(heartdata_scaled,n=nrow(heartdata_scaled)-1)
## $H
## [1] 0.2277675
# Visualizing the dissimilarity Matrix where red depicts high similarity and blue low similarity

fviz_dist(dist(heartdata_scaled),show_labels = FALSE)+labs(title = "Heart Risk Data Set")

# Comparing Clustering Algorithms

library(clValid)
## Warning: package 'clValid' was built under R version 3.6.3
library(clustertend)
library(factoextra)
heartdata_initial<-read_excel("heart.xlsx")

heartdata_initial$Sex<-factor(heartdata_initial$Sex)
heartdata_initial$ChestPainType<-as.factor(heartdata_initial$ChestPainType)
heartdata_initial$FastingBPmorethan120<-as.factor(heartdata_initial$FastingBPmorethan120)
heartdata_initial$ExerciseInducedAngina<-as.factor(heartdata_initial$ExerciseInducedAngina)
heartdata_initial$Thal<-as.factor(heartdata_initial$Thal)
#heartdata_initial$ColoredVessels<-as.factor(heartdata_initial$ColoredVessels)
heartdata_initial$HeartAttackRisk<-as.factor(heartdata_initial$HeartAttackRisk)
heartdata_initial$RestingElectrographicResults<-as.factor(heartdata_initial$RestingElectrographicResults)

# Summarizing the data
summary(heartdata_initial)
##       Age        Sex     ChestPainType RestingBloodPressure  Cholesterol   
##  Min.   :29.00   0: 96   0:143         Min.   : 94.0        Min.   :126.0  
##  1st Qu.:47.50   1:207   1: 50         1st Qu.:120.0        1st Qu.:211.0  
##  Median :55.00           2: 87         Median :130.0        Median :240.0  
##  Mean   :54.37           3: 23         Mean   :131.6        Mean   :246.3  
##  3rd Qu.:61.00                         3rd Qu.:140.0        3rd Qu.:274.5  
##  Max.   :77.00                         Max.   :200.0        Max.   :564.0  
##  FastingBPmorethan120 RestingElectrographicResults MaximumHeartRate
##  0:258                0:147                        Min.   : 71.0   
##  1: 45                1:152                        1st Qu.:133.5   
##                       2:  4                        Median :153.0   
##                                                    Mean   :149.6   
##                                                    3rd Qu.:166.0   
##                                                    Max.   :202.0   
##  ExerciseInducedAngina    Oldpeak         Slope       ColoredVessels   Thal   
##  0:204                 Min.   :0.00   Min.   :0.000   Min.   :0.0000   0:  2  
##  1: 99                 1st Qu.:0.00   1st Qu.:1.000   1st Qu.:0.0000   1: 18  
##                        Median :0.80   Median :1.000   Median :0.0000   2:166  
##                        Mean   :1.04   Mean   :1.399   Mean   :0.7294   3:117  
##                        3rd Qu.:1.60   3rd Qu.:2.000   3rd Qu.:1.0000          
##                        Max.   :6.20   Max.   :2.000   Max.   :4.0000          
##  HeartAttackRisk
##  0:138          
##  1:165          
##                 
##                 
##                 
## 
# Scaling the data
heartdata<-heartdata_initial[c(1,4,5,8)]
summary(heartdata)
##       Age        RestingBloodPressure  Cholesterol    MaximumHeartRate
##  Min.   :29.00   Min.   : 94.0        Min.   :126.0   Min.   : 71.0   
##  1st Qu.:47.50   1st Qu.:120.0        1st Qu.:211.0   1st Qu.:133.5   
##  Median :55.00   Median :130.0        Median :240.0   Median :153.0   
##  Mean   :54.37   Mean   :131.6        Mean   :246.3   Mean   :149.6   
##  3rd Qu.:61.00   3rd Qu.:140.0        3rd Qu.:274.5   3rd Qu.:166.0   
##  Max.   :77.00   Max.   :200.0        Max.   :564.0   Max.   :202.0
heartdata_scaled<-scale(heartdata)
summary(heartdata_scaled)
##       Age           RestingBloodPressure  Cholesterol      MaximumHeartRate 
##  Min.   :-2.79300   Min.   :-2.14525     Min.   :-2.3203   Min.   :-3.4336  
##  1st Qu.:-0.75603   1st Qu.:-0.66277     1st Qu.:-0.6804   1st Qu.:-0.7049  
##  Median : 0.06977   Median :-0.09259     Median :-0.1209   Median : 0.1464  
##  Mean   : 0.00000   Mean   : 0.00000     Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.73041   3rd Qu.: 0.47760     3rd Qu.: 0.5448   3rd Qu.: 0.7139  
##  Max.   : 2.49212   Max.   : 3.89872     Max.   : 6.1303   Max.   : 2.2856
cluster_method<-c("hierarchical","kmeans","pam")
check<-clValid(heartdata_scaled,nClust=2:6,clMethods=cluster_method,validation="internal")
## Warning in clValid(heartdata_scaled, nClust = 2:6, clMethods = cluster_method, :
## rownames for data not specified, using 1:nrow(data)
summary(check)
## 
## Clustering Methods:
##  hierarchical kmeans pam 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                                   2        3        4        5        6
##                                                                        
## hierarchical Connectivity    6.9282   9.0115  28.5460  80.0008  81.8048
##              Dunn            0.1543   0.1543   0.1072   0.0920   0.0920
##              Silhouette      0.3900   0.2925   0.2133   0.1918   0.1262
## kmeans       Connectivity   80.7409  98.0278 119.8087 162.1262 168.1238
##              Dunn            0.0260   0.0441   0.0598   0.0579   0.0699
##              Silhouette      0.2386   0.2394   0.2379   0.1949   0.2077
## pam          Connectivity   68.1496 115.7476 160.5782 163.4405 188.6448
##              Dunn            0.0313   0.0313   0.0218   0.0546   0.0463
##              Silhouette      0.2246   0.2086   0.1929   0.1769   0.1814
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 6.9282 hierarchical 2       
## Dunn         0.1543 hierarchical 2       
## Silhouette   0.3900 hierarchical 2
cluster_method<-c("hierarchical","kmeans","pam")
check_stability<-clValid(heartdata_scaled,nClust=2:6,clMethods=cluster_method,validation="stability")
## Warning in clValid(heartdata_scaled, nClust = 2:6, clMethods = cluster_method, :
## rownames for data not specified, using 1:nrow(data)
optimalScores(check_stability)
##          Score       Method Clusters
## APN 0.02777261 hierarchical        2
## AD  2.04004031       kmeans        6
## ADM 0.11080351 hierarchical        2
## FOM 0.94389095          pam        6

Market Basket Analysis using Association Rules

Objective:

** Market Basket Analysis and Association Rules**

Measuring Rule Interest

\[Support(X)=\frac{Count(X)}{N}\]

\[Confidence(X\rightarrow Y)=\frac{support(X,Y)}{support(X)}\]

Association Rule Formulation Steps

  1. Identify the itemsets that meet the minimum support threshold.
  2. Constructing rules from itemsets that meet a minimum confidence threshold.
  3. The first phase is to evaluate the itemsets of each size starting with i=1.
  4. The first iteration consisting of computing the support of the itemset composed of 1 item, the second iteration is composed of 2 items etc.
  5. Each iteration results in i itemsets that meet the threshold criterion of the threshold support.
  6. The algorithm will eliminate some of the combination even before the next round. If {D} is infrequent in the first iteration from amongst A, B, C, D then {A,B} {A,C} and{B,C} will be evaluated .
  7. If for the next iteration it is discovered that {A,B} and {B,C} are frequent but {A,C} is not then any further combination consisting of {A,C} like {A,B,C} need not be considered.
  8. For the third iteration we cannot generate any itemset therefore algorithm will stop.
  9. Finally for the next phase the association rules are generated using confidence threshold evaluation.